Weakly-supervised temporal action localization (WTAL) learns to detect and classify action instances with only category labels. Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization. However, the different optimization objectives between classification and localization, make temporally localized results suffer from the serious incomplete issue. To tackle this issue without additional annotations, this paper considers to distill free action knowledge from Vision-Language Pre-training (VLP), since we surprisingly observe that the localization results of vanilla VLP have an over-complete issue, which is just complementary to the CBP results. To fuse such complementarity, we propose a novel distillation-collaboration framework with two branches acting as CBP and VLP respectively. The framework is optimized through a dual-branch alternate training strategy. Specifically, during the B step, we distill the confident background pseudo-labels from the CBP branch; while during the F step, the confident foreground pseudo-labels are distilled from the VLP branch. And as a result, the dual-branch complementarity is effectively fused to promote a strong alliance. Extensive experiments and ablation studies on THUMOS14 and ActivityNet1.2 reveal that our method significantly outperforms state-of-the-art methods.
translated by 谷歌翻译
The efficient segmentation of foreground text information from the background in degraded color document images is a hot research topic. Due to the imperfect preservation of ancient documents over a long period of time, various types of degradation, including staining, yellowing, and ink seepage, have seriously affected the results of image binarization. In this paper, a three-stage method is proposed for image enhancement and binarization of degraded color document images by using discrete wavelet transform (DWT) and generative adversarial network (GAN). In Stage-1, we use DWT and retain the LL subband images to achieve the image enhancement. In Stage-2, the original input image is split into four (Red, Green, Blue and Gray) single-channel images, each of which trains the independent adversarial networks. The trained adversarial network models are used to extract the color foreground information from the images. In Stage-3, in order to combine global and local features, the output image from Stage-2 and the original input image are used to train the independent adversarial networks for document binarization. The experimental results demonstrate that our proposed method outperforms many classical and state-of-the-art (SOTA) methods on the Document Image Binarization Contest (DIBCO) dataset. We release our implementation code at https://github.com/abcpp12383/ThreeStageBinarization.
translated by 谷歌翻译
光学相干断层扫描(OCT)是一种非侵入性技术,可在微米分辨率中捕获视网膜的横截面区域。它已被广泛用作辅助成像参考,以检测与眼睛有关的病理学并预测疾病特征的纵向进展。视网膜层分割是至关重要的特征提取技术之一,其中视网膜层厚度的变化和由于液体的存在而引起的视网膜层变形高度相关,与多种流行性眼部疾病(如糖尿病性视网膜病)和年龄相关的黄斑疾病高度相关。变性(AMD)。但是,这些图像是从具有不同强度分布或换句话说的不同设备中获取的,属于不同的成像域。本文提出了一种分割引导的域适应方法,以将来自多个设备的图像调整为单个图像域,其中可用的最先进的预训练模型可用。它避免了即将推出的新数据集的手动标签的时间消耗以及现有网络的重新培训。网络的语义一致性和全球特征一致性将最大程度地减少许多研究人员报告的幻觉效果,这些效应对周期矛盾的生成对抗网络(Cyclegan)体系结构。
translated by 谷歌翻译
我们提出了Blenderbot 3,这是一个175B参数对话模型,能够通过访问Internet和长期内存进行开放域对话,并接受了大量用户定义的任务的培训。我们同时发布了模型权重和代码,还将模型部署在公共网页上,以与有机用户进行交互。该技术报告描述了该模型的构建方式(建筑,模型和培训计划)以及其部署的细节,包括安全机制。人类评估表明,它优于现有的开放域对话代理,包括其前身(Roller等,2021; Komeili等,2022)。最后,我们使用部署收集的数据详细介绍了持续学习的计划,该数据也将公开发布。因此,该研究计划的目标是使社区能够研究通过互动学习的不断改进的负责任的代理商。
translated by 谷歌翻译
树木修剪过程是促进水果生长并改善其生产的关键,这是由于对分支机构水果和营养运输的光合作用效率的影响。目前,修剪仍然高度依赖人类劳动。工人的经验将强烈影响树修剪性能的稳健性。因此,对于工人和农民来说,评估修剪性能是一个挑战。本文旨在为了更好地解决该问题,提出了一种新型的修剪分类策略模型,称为“ OTSU-SVM”,以根据分支和叶子的阴影评估修剪性能。该模型不仅考虑了树的可用照明区域,还考虑了树的照明区域的均匀性。更重要的是,我们的小组将OTSU算法实现到该模型中,该算法高度增强了该模型评估的鲁棒性。此外,实验中还使用了来自Yuhang区的梨树的数据。在该实验中,我们证明了OTSU-SVM具有良好的精度,在评估梨树的修剪时具有80%的性能和高性能。如果应用于果园,它可以提供更成功的修剪。成功的修剪可以扩大单个水果的照明区域,并增加目标分支的营养运输,从而显着提高水果的重量和生产。
translated by 谷歌翻译
我们提出了一个简单而有效的自我监督框架,用于视听表示学习,以将声源定位在视频中。为了了解什么使能够学习有用的表示形式,我们系统地研究了数据增强的效果,并揭示(1)数据增强的组成起着关键作用,{\ em I.E.}〜明确鼓励音频表征是不变的各种转换〜({\ em转换不变性}); (2)强制执行几何一致性基本上提高了学会表示的质量,{\ em,即}〜所检测到的声源应遵循在输入视频帧〜({\ em em transive equivarianciance})上应用的相同转换。广泛的实验表明,我们的模型在两个声音定位基准上的先前方法(即Flickr-soundnet和vgg-sounds)都显着优于先前的方法。此外,我们还评估了音频检索和跨模式检索任务。在这两种情况下,我们的自我监管模型都表现出了出色的检索性能,甚至在音频检索中具有监督方法竞争。这揭示了所提出的框架学会了强大的多模式表示,这些表示有益于声音定位和对进一步应用的概括。 \ textIt {所有代码都将可用}。
translated by 谷歌翻译
金融领域的数值推理 - 进行定量分析并总结了财务报告中的信息 - 可以大大提高业务效率并降低数十亿美元的成本。在这里,我们提出了一个数值推理问答系统,以回答财务文本和表数据源之间的数值推理问题,该问题由回收器模块,发电机模块和集合模块组成。具体而言,除了检索整个行数据外,我们还创新设计了一个细胞回收器,该池检索器可以检索金单元,以避免将同一行中的无关和相似的单元带到发电机模块的输入中。在发电机模块中,我们利用多个发电机来生产程序,这是回答问题的操作步骤。最后,在整体模块中,我们集成了多个程序,以选择最佳程序作为系统的输出。在FinQA竞争中的最终私人测试集中,我们的系统获得了69.79的执行精度。
translated by 谷歌翻译
Recent advance on linear support vector machine with the 0-1 soft margin loss ($L_{0/1}$-SVM) shows that the 0-1 loss problem can be solved directly. However, its theoretical and algorithmic requirements restrict us extending the linear solving framework to its nonlinear kernel form directly, the absence of explicit expression of Lagrangian dual function of $L_{0/1}$-SVM is one big deficiency among of them. In this paper, by applying the nonparametric representation theorem, we propose a nonlinear model for support vector machine with 0-1 soft margin loss, called $L_{0/1}$-KSVM, which cunningly involves the kernel technique into it and more importantly, follows the success on systematically solving its linear task. Its optimal condition is explored theoretically and a working set selection alternating direction method of multipliers (ADMM) algorithm is introduced to acquire its numerical solution. Moreover, we firstly present a closed-form definition to the support vector (SV) of $L_{0/1}$-KSVM. Theoretically, we prove that all SVs of $L_{0/1}$-KSVM are only located on the parallel decision surfaces. The experiment part also shows that $L_{0/1}$-KSVM has much fewer SVs, simultaneously with a decent predicting accuracy, when comparing to its linear peer $L_{0/1}$-SVM and the other six nonlinear benchmark SVM classifiers.
translated by 谷歌翻译
在没有任何额外的训练数据的情况下,对计算机视觉中的逆问题显示出显着的潜力,展示了显着的潜力。实用的DIP型号通常很大程度上过分分开。在拟合过程中,这些模型首先学习所需的视觉内容,然后拾取潜在的建模和观察噪声,即过度装箱。因此,DIP的实用性通常在恢复过渡期的良好早期停止(ES)上批判密地取决于统治性。在这方面,愿景任务的大多数DIP工程只展示了模型的潜力 - 向地面真理报告峰值性能,但没有关于如何在没有访问地面的情况下可操作地获得近峰值性能的线索。在本文中,我们设定了破坏了这种倾向的实用屏障,并提出了一种有效的ES策略,该策略一致地检测多个视觉任务和DIP变体的近峰值性能。基于连续DIP重建的分散的简单测量,我们的es方法不仅会在现有的域中突破 - 这仅在非常窄的域中工作,而且在与许多尝试减轻过度装备的方法时也保持有效。该代码可在https://github.com/sun-umn/early_stopping_for_dip中找到。
translated by 谷歌翻译
视觉语言预培训对从大规模Web数据学习联合视觉文本表示的巨大成功,展示了零拍广泛的显着能力。本文介绍了一种简单的方法,可以将一个预先训练的视觉语言模型有效地调整到具有最小培训的新型任务,以及这里,我们考虑视频了解任务。具体而言,我们建议优化几个随机向量,称为连续提示向量,将新颖任务转换为与预培训目标相同的格式。此外,为了弥合静态图像和视频之间的差距,用堆叠在框架明智的视觉特征之上的轻量压变压器编码时分信息。在实验上,我们进行广泛的消融研究,以分析关键组成部分和必需品。在9个公共基准的行动认可,行动本地化和文本 - 视频检索,跨封闭式,几次射击,开放式场景,我们为现有方法实现了竞争或最先进的性能,尽管培训显着更少的参数。
translated by 谷歌翻译